Linear regression model with histogram-valued variables

نویسندگان

  • Sónia Dias
  • Paula Brito
چکیده

Histogram-valued variables are a particular kind of variables studied in Symbolic Data Analysis where to each entity under analysis corresponds a distribution that may be represented by a histogram or by a quantile function. Linear regression models for this type of data are necessarily more complex than a simple generalization of the classical model: the parameters cannot be negative; still the linear relation between the variables must be allowed to be either direct or inverse. In this work, we propose a new linear regression model for histogram-valued variables that solves this problem, named Distribution and Symmetric Distribution Regression Model. To determine the parameters of this model, it is necessary to solve a quadratic optimization problem, subject to non-negativity constraints on the unknowns; the error measure between the predicted and observed distributions uses the Mallows distance. As in classical analysis, the model is associated with a goodness-of-fit measure whose values range between 0 and 1. Using the proposed model, applications with real and simulated data are presented. © 2015 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2015

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Não preencher

Symbolic Data Analysis is concerned with data tables where the values in each cell are not single values but elements that express the variability of the records, e.g., intervals or histograms. Symbolic linear regression aims at investigating the linear relationship between histogram or interval-valued variables. In this paper, we study two real data problems: in a first one, symbolic models ar...

متن کامل

Linear regression for numeric symbolic variables: an ordinary least squares approach based on Wasserstein Distance

In this paper we present a linear regression model for modal symbolic data. The observed variables are histogram variables according to the definition given in Bock and Diday [1] and the parameters of the model are estimated using the classic Least Squares method. An appropriate metric is introduced in order to measure the error between the observed and the predicted distributions. In particula...

متن کامل

Lasso-based linear regression for interval-valued data

In regression analysis the relationship between one response and a set of explanatory variables is investigated. The (response and explanatory) variables are usually single-valued. However, in several real-life situations, the available information may be formalized in terms of intervals. An interval-valued datum can be described by the midpoint (its center) and the radius (its half width). Her...

متن کامل

How Robust Is Linear Regression with Dummy Variables

Researchers in education and the social sciences make extensive use of linear regression models in which the dependent variable is continuous-valued while the explanatory variables are a combination of continuous-valued regressors and dummy variables. The dummies partition the sample into groups, some of which may contain only a few observations. Such groups may easily include enough outliers t...

متن کامل

Testing linear independence in linear models with interval-valued data

Testing methods are introduced in order to determine whether there is some ‘linear’ relationship between imprecise predictor and response variables in a regression analysis. The variables are assumed to be interval-valued. Within this context, the variables are formalized as compact convex random sets, and an interval arithmetic-based linear model is considered. Then, a suitable equivalence for...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Statistical Analysis and Data Mining

دوره 8  شماره 

صفحات  -

تاریخ انتشار 2015